Using chemical structure in open-source chemical text mining
نویسندگان
چکیده
A great wealth of chemical information is to be found in the literature. For example, PubMed contains of the order of 15 million abstracts, a significant proportion of which contain information about chemicals, their biological activity and reactivity. In order to analyse this information , it must first be extracted from the literature – a task that can be performed by computers as well as by humans. OSCAR3 is an open source chemistry text mining tool, which can find chemical names, ontology terms and experimental data in chemistry papers, biomedical abstracts and other texts [1,2]. Recent advances in recognition techniques enable recognition with precision and recall of 80% or better, adjustable to higher recall or higher precision, at a rate of about one abstract per second. A key feature of OSCAR3 is that it can produce molecular structures for the names it finds. This enables structure-based chemical informatics techniques such as substructure search and molecular similarity to be added to the repertoire of text mining methodologies, increasing the range of information that can be extracted, and the range of analyses that can be performed. Preliminary results show that it is possible to mine metabolic reactions from a corpus of cytochrome P450 abstracts. The names of substrates and reactions can be spotted using OSCAR3, and related to each other via pattern matching. In many cases, it is possible to infer the products of the reactions, even though they are not stated explicitly. For example, a paper can mention " the O-demethylation of codeine ". From this, it is possible to use the chemical structure of codeine – by finding the O-methyl group and removing it-to infer that the product of the reaction is morphine, even though this is not explicitly stated in the paper. Another attractive use for chemical structure in text mining makes use of molecular similarity. Given the structure of a chemical, it is possible to find structurally-related compounds via fingerprint-based techniques, and to correlate the occurrence of those related compounds in a corpus with the occurrence of names of cytochrome P450s to make predictions about which P450s interact with the target molecule.
منابع مشابه
SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents
The patent literature is a rich catalog of biologically relevant chemicals; many public and commercial molecular databases contain the structures disclosed in patent claims. However, patents are an equally rich source of metadata about bioactive molecules, including mechanism of action, disease class, homologous experimental series, structural alternatives, or the synthetic pathways used to pro...
متن کاملComparing manual and automated extraction of chemical entities from documents
The chemical information landscape is changing rapidly with a yearly increase of over 1 million new compounds and more than 700,000 publications related to chemistry [1]. Exploring the chemical space covered by relevant journals and patents is a crucial step in early stage medicinal chemistry projects. Extracting chemical entities from unstructured text is a complex task and different approache...
متن کاملImage-to-Structure Task by ChemReader
Chemical structure recognition software aims to extract raster images of 2D chemical structure diagrams and convert them into a standard, machine readable chemical file format. Such software, so called chemical OCR can be used for mining chemical entities appeared in scientific literature. Since traditional text based mining methods haven’t attempt to utilize image data in documents yet, chemi...
متن کاملOSCAR4: a flexible architecture for chemical text-mining
The Open-Source Chemistry Analysis Routines (OSCAR) software, a toolkit for the recognition of named entities and data in chemistry publications, has been developed since 2002. Recent work has resulted in the separation of the core OSCAR functionality and its release as the OSCAR4 library. This library features a modular API (based on reduction of surface coupling) that permits client programme...
متن کاملImprovements in Optical Structure Recognition Application
We present recent improvements of the Optical Structure Recognition Application (OSRA), an open source utility to convert images of chemical structures to connection table type description in an established computerized molecular format. There exists a large body of chemical information which has remained largely inaccessible to machine data mining techniques so far. One of the most common ways...
متن کامل